What is your intended audience? (E.g., the CEO of Uber, readers of the Star Tribune, people subscribed to the r/dataisbeautiful subreddit, etc. Your audience should not be "students in STAT 336" or "Dr. McNamara.")
Where did your data came from? In broad strokes, what did you need to do in order to clean and visualize it?
Why did you make the design decisions you did? (E.g., mappings in the visualization, color scheme choices, rounding decisions, specific language in a written piece, images on a PowerPoint slide, etc.)
I’m combining these into one for the simplicity of one document.
I chose to look at vitamin and nutrient intake. I did this for a couple of reasons, mainly because I’ve been really into health and biology my entire life and as of recently I’ve lost an additional 35lb and I wanted to see what correlated with positive health and things like that. Secondly it allowed me to combine my biochemistry project which ended up not really applying to this outside of research but I did get to apply a lot of knowledge I learned in that class to a topic which I am interested in.
I wasn’t exactly sure when I started this what audience I wanted to go at. There is a lot of information and details that would’ve been hard to explain to a more general audience so I chose to sort of focus on a some what more educated audience. With that said It also is information that would be useful for everyone. I tried writing an article like I may see from the CDC on their website but with a little more information about the biochemical interactions since that was the main purpose in my assignment. I included a link below but the data came from the CDC and specifically I used the last 5 years of data for the NHANES Survey. I had to do a lot to clean and make it usable. the day 1 base DF has 166 dimensions and the day 2 has 85+. I coded it to look into any path and read all the files within the folder including any folders within that folder, this allowed me to more easily merge and combine the data rather then type it all out. It also was fun to figure out. I chose to leave demographics alone for now just because one of the files had different headings and considering how much work it was to just do the 10 other files it was just best to leave it. That choice did end up making things complicated as there was very few proper discrete variables (like I had 1 maybe 2).
Having only Year to work with for a discrete variable meant I was stuck plotting over time, which I was fine with or it meant I needed to go and data wrangle the demographic data tied to each participant and get it assigned properly in the DF so it was usable. I didn’t do that mainly for time purposes and I will end up going back. For the plots, I tried a lot. You will see with how many packages I loaded mainly just trying different changes, most of which didn’t turn out with anything good. Plotting 85000 data points would make things not clear and it kinda of just made things look like a density plot but more creative. This is why I only have column charts.
The plots I chose to include in my actual paper take the 538 generic theme from ggthemes and apply it. I personally think it looks the best and it is clear without any distractions. The removal of axis titles makes the chart cleaner and I kept them off with the idea I would be adding captions and they would support the text so context would be present. I did tweak aspect ratio, fill color, and bar width for each chart and set expand to 0,0. I also added a black line at the base of each plot, this in my opinion define the starting axis and put a little more structure. I did want to change the limits on the Y axis but when I would attempt to change them the data would not display on the chart.
The choice to expand the width of the bars came after I initially plotted the chart knowing what I was supposed to see which was a gradual increase (See the “Brown” bar chart below). I was not able to tell what the chart was attempting say and the bars all kind of looked similar. Once expanding so they were more side by side it was easier to determine what was being shown. In this data it mainly was small changes and while I wasn’t able to run an anova they do lead and can direct conclusions based on trends seen. Color was more selected random for me since I didn’t have a really good way to display extra variables. I chose the colors honestly because I liked them, but they all were different enough they don’t look like similar plots. There is a dot plot below with awful coloring and I never changed it because I chose not to use it. It originally had a green to red gradient that looked like a color blind test.
Behind The Scenes - My Pain
Data can be downloaded from here https://www.cdc.gov/nchs/nhanes/index.htm
Libraries
library(haven)
Warning: package 'haven' was built under R version 4.1.2
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.1.2
Warning: package 'mosaic' was built under R version 4.1.2
Registered S3 method overwritten by 'mosaic':
method from
fortify.SpatialPolygonsDataFrame ggplot2
The 'mosaic' package masks several functions from core packages in order to add
additional features. The original behavior of these functions should not be affected by this.
Attaching package: 'mosaic'
The following object is masked from 'package:Matrix':
mean
The following objects are masked from 'package:dplyr':
count, do, tally
The following object is masked from 'package:purrr':
cross
The following object is masked from 'package:ggplot2':
stat
The following objects are masked from 'package:stats':
binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
quantile, sd, t.test, var
The following objects are masked from 'package:base':
max, mean, min, prod, range, sample, sum
library(moments)
Warning: package 'moments' was built under R version 4.1.2
library(corrplot)
corrplot 0.92 loaded
library(tidyr)library(dplyr)library(labelled)
Warning: package 'labelled' was built under R version 4.1.2
Attaching package: 'labelled'
The following object is masked from 'package:ggformula':
set_variable_labels
library(rayshader)library(plotly)
Warning: package 'plotly' was built under R version 4.1.2
Attaching package: 'plotly'
The following object is masked from 'package:mosaic':
do
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(reshape2)
Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':
smiths
library(scales)
Warning: package 'scales' was built under R version 4.1.2
Attaching package: 'scales'
The following object is masked from 'package:mosaic':
rescale
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
library(ggthemes)
Attaching package: 'ggthemes'
The following object is masked from 'package:mosaic':
theme_map
Current Elapsed Time: 3.848736 Seconds
Data Read From Folders
Data Cleaning
This was really annoying. it didn’t like the column names even when they were the same, and then it wasn’t assigning stuff to proper places. BUT it works now and its cool
startTime <-Sys.time()data_cleanishD1 <-data.frame()yearHold <-seq(2012, 2020,2)for (i in1:length(Day1Raw)){if(length(Day1Raw[[i]])>166){ tempdata <- Day1Raw[[i]][,c(1,18:101)] tempdata$Year <- yearHold[i] } #15:29 Different Diet Dataelse{ tempdata <- (Day1Raw[[i]][,c(1,16:99)]) tempdata$Year <- yearHold[i] }#Testing, i'm going to lose it if this fixes itif (i >1){colnames(tempdata) <- coltemp }print("Working") data_cleanishD1 <-rbind(data_cleanishD1, tempdata) coltemp <-colnames(data_cleanishD1)}
#ComCorFullHold <- is.na(ComCorFull)#ComCorFull[ComCorFullHold] <- 0#corrplot(ComCorFull, method = "color", tl.cex = .6, order = "hclust", addrect = 6, rect.lwd = 1.5)#Does not work, possible binary variables end up blank.
#From here I worked to remove variables which were irrelevant or showed the same thing. For example ID and year both go up and would be nearly perfectly correlated. Also removed a lot of fats as they are not needed, I kept Omega-3 DHA and ALA
This was the start of my actual looking more at the data.
Plotting Creation
played around with this for a while, the colors never really produced a gradient I liked that was beneficial. The colors are not pretty, but it is still better then green -> orange -> red. that looked really bad